我们提出了一种确定性等效方案,以自适应控制标量线性系统,约为I.I.D.高斯干扰和有限的控制输入约束,而无需先验系统参数的界限,也不需要控制方向。假设该系统处于偏差稳定的范围内,则证明了闭环系统状态的均方根界。最后,提出了数值示例,以说明我们的结果。
translated by 谷歌翻译
Humans have internal models of robots (like their physical capabilities), the world (like what will happen next), and their tasks (like a preferred goal). However, human internal models are not always perfect: for example, it is easy to underestimate a robot's inertia. Nevertheless, these models change and improve over time as humans gather more experience. Interestingly, robot actions influence what this experience is, and therefore influence how people's internal models change. In this work we take a step towards enabling robots to understand the influence they have, leverage it to better assist people, and help human models more quickly align with reality. Our key idea is to model the human's learning as a nonlinear dynamical system which evolves the human's internal model given new observations. We formulate a novel optimization problem to infer the human's learning dynamics from demonstrations that naturally exhibit human learning. We then formalize how robots can influence human learning by embedding the human's learning dynamics model into the robot planning problem. Although our formulations provide concrete problem statements, they are intractable to solve in full generality. We contribute an approximation that sacrifices the complexity of the human internal models we can represent, but enables robots to learn the nonlinear dynamics of these internal models. We evaluate our inference and planning methods in a suite of simulated environments and an in-person user study, where a 7DOF robotic arm teaches participants to be better teleoperators. While influencing human learning remains an open problem, our results demonstrate that this influence is possible and can be helpful in real human-robot interaction.
translated by 谷歌翻译
When robots learn reward functions using high capacity models that take raw state directly as input, they need to both learn a representation for what matters in the task -- the task ``features" -- as well as how to combine these features into a single objective. If they try to do both at once from input designed to teach the full reward function, it is easy to end up with a representation that contains spurious correlations in the data, which fails to generalize to new settings. Instead, our ultimate goal is to enable robots to identify and isolate the causal features that people actually care about and use when they represent states and behavior. Our idea is that we can tune into this representation by asking users what behaviors they consider similar: behaviors will be similar if the features that matter are similar, even if low-level behavior is different; conversely, behaviors will be different if even one of the features that matter differs. This, in turn, is what enables the robot to disambiguate between what needs to go into the representation versus what is spurious, as well as what aspects of behavior can be compressed together versus not. The notion of learning representations based on similarity has a nice parallel in contrastive learning, a self-supervised representation learning technique that maps visually similar data points to similar embeddings, where similarity is defined by a designer through data augmentation heuristics. By contrast, in order to learn the representations that people use, so we can learn their preferences and objectives, we use their definition of similarity. In simulation as well as in a user study, we show that learning through such similarity queries leads to representations that, while far from perfect, are indeed more generalizable than self-supervised and task-input alternatives.
translated by 谷歌翻译
Inferring reward functions from human behavior is at the center of value alignment - aligning AI objectives with what we, humans, actually want. But doing so relies on models of how humans behave given their objectives. After decades of research in cognitive science, neuroscience, and behavioral economics, obtaining accurate human models remains an open research topic. This begs the question: how accurate do these models need to be in order for the reward inference to be accurate? On the one hand, if small errors in the model can lead to catastrophic error in inference, the entire framework of reward learning seems ill-fated, as we will never have perfect models of human behavior. On the other hand, if as our models improve, we can have a guarantee that reward accuracy also improves, this would show the benefit of more work on the modeling side. We study this question both theoretically and empirically. We do show that it is unfortunately possible to construct small adversarial biases in behavior that lead to arbitrarily large errors in the inferred reward. However, and arguably more importantly, we are also able to identify reasonable assumptions under which the reward inference error can be bounded linearly in the error in the human model. Finally, we verify our theoretical insights in discrete and continuous control tasks with simulated and human data.
translated by 谷歌翻译
Recent work in sim2real has successfully enabled robots to act in physical environments by training in simulation with a diverse ''population'' of environments (i.e. domain randomization). In this work, we focus on enabling generalization in assistive tasks: tasks in which the robot is acting to assist a user (e.g. helping someone with motor impairments with bathing or with scratching an itch). Such tasks are particularly interesting relative to prior sim2real successes because the environment now contains a human who is also acting. This complicates the problem because the diversity of human users (instead of merely physical environment parameters) is more difficult to capture in a population, thus increasing the likelihood of encountering out-of-distribution (OOD) human policies at test time. We advocate that generalization to such OOD policies benefits from (1) learning a good latent representation for human policies that test-time humans can accurately be mapped to, and (2) making that representation adaptable with test-time interaction data, instead of relying on it to perfectly capture the space of human policies based on the simulated population only. We study how to best learn such a representation by evaluating on purposefully constructed OOD test policies. We find that sim2real methods that encode environment (or population) parameters and work well in tasks that robots do in isolation, do not work well in assistance. In assistance, it seems crucial to train the representation based on the history of interaction directly, because that is what the robot will have access to at test time. Further, training these representations to then predict human actions not only gives them better structure, but also enables them to be fine-tuned at test-time, when the robot observes the partner act. https://adaptive-caregiver.github.io.
translated by 谷歌翻译
One of the most successful paradigms for reward learning uses human feedback in the form of comparisons. Although these methods hold promise, human comparison labeling is expensive and time consuming, constituting a major bottleneck to their broader applicability. Our insight is that we can greatly improve how effectively human time is used in these approaches by batching comparisons together, rather than having the human label each comparison individually. To do so, we leverage data dimensionality-reduction and visualization techniques to provide the human with a interactive GUI displaying the state space, in which the user can label subportions of the state space. Across some simple Mujoco tasks, we show that this high-level approach holds promise and is able to greatly increase the performance of the resulting agents, provided the same amount of human labeling time.
translated by 谷歌翻译
电力系统状态估计面临着不同类型的异常。这些可能包括由总测量错误或通信系统故障引起的不良数据。根据实施的状态估计方法,负载或发电的突然变化可以视为异常。此外,将电网视为网络物理系统,状态估计变得容易受到虚假数据注射攻击的影响。现有的异常分类方法无法准确对上述三种异常进行分类(区分),尤其是在歧视突然的负载变化和虚假数据注入攻击时。本文提出了一种用于检测异常存在,对异常类型进行分类并识别异常起源的新算法更改或通过错误数据注入攻击针对的状态变量。该算法结合了分析和机器学习(ML)方法。第一阶段通过组合$ \ chi^2 $检测指数来利用一种分析方法来检测异常存在。第二阶段利用ML进行异常类型的分类和其来源的识别,特别是指突然负载变化和错误数据注射攻击的歧视。提出的基于ML的方法经过训练,可以独立于网络配置,该网络配置消除了网络拓扑变化后算法的重新训练。通过在IEEE 14总线测试系统上实施拟议的算法获得的结果证明了拟议算法的准确性和有效性。
translated by 谷歌翻译
当从人类行为中推断出奖励功能(无论是演示,比较,物理校正或电子停靠点)时,它已证明对人类进行建模作为做出嘈杂的理性选择,并具有“合理性系数”,以捕获多少噪声或熵我们希望看到人类的行为。无论人类反馈的类型或质量如何,许多现有作品都选择修复此系数。但是,在某些情况下,进行演示可能要比回答比较查询要困难得多。在这种情况下,我们应该期望在示范中看到比比较中更多的噪音或次级临时性,并且应该相应地解释反馈。在这项工作中,我们提倡,将每种反馈类型的实际数据中的理性系数扎根,而不是假设默认值,对奖励学习具有重大的积极影响。我们在模拟反馈以及用户研究的实验中测试了这一点。我们发现,从单一反馈类型中学习时,高估人类理性可能会对奖励准确性和遗憾产生可怕的影响。此外,我们发现合理性层面会影响每种反馈类型的信息性:令人惊讶的是,示威并不总是最有用的信息 - 当人类的行为非常卑鄙时,即使在合理性水平相同的情况下,比较实际上就变得更加有用。 。此外,当机器人确定要要求的反馈类型时,它可以通过准确建模每种类型的理性水平来获得很大的优势。最终,我们的结果强调了关注假定理性级别的重要性,不仅是在从单个反馈类型中学习时,尤其是当代理商从多种反馈类型中学习时,尤其是在学习时。
translated by 谷歌翻译
用于培训糖尿病性视网膜病变分类器的公共可用数据是不平衡的。生成的对抗网络可以成功合成视网膜底面图像。为了使合成图像有益,图像必须具有高质量和多样化。目前,使用了几种评估指标来评估生成对抗网络合成的图像的质量和多样性。这项工作是对文献中用于评估生成对抗网络的评估指标的适用性的第一个此类经验评估,用于在糖尿病性视网膜病变的背景下生成视网膜底面图像。 Frechet成立距离,峰值信噪比和余弦距离评估合成增殖性糖尿病夺回图像的质量和多样性的能力。进行定量分析以启用改进的方法,以选择用于增强分类器培训数据集的合成图像。结果表明,Frechet的启动距离适合评估合成图像的多样性,并用于识别该图像是否具有与其类标签相对应的特征。峰值信噪比适合指示合成图像是否具有有效的糖尿病性视网膜病变,并且其特征是否与其类标记相对应。这些结果证明了执行这种经验评估的重要性,尤其是在旨在应用于应用环境中利用的生物医学领域的背景下。
translated by 谷歌翻译
我们如何才能训练辅助人机接口(例如,基于肌电图的肢体假体),将用户的原始命令信号转换为机器人或计算机的动作,如果没有事先映射,我们不能要求用户进行监督动作标签或奖励反馈的形式,我们对用户试图完成的任务没有事先了解?本文中的关键想法是,无论任务如何,当接口更直观时,用户的命令就会不那么嘈杂。我们将这一想法形式化为一个完全无监督的目标,以优化接口:用户的命令信号与环境中的诱导状态过渡之间的相互信息。为了评估此相互信息得分是否可以区分有效的界面和无效界面,我们对540K的示例进行了观察性研究,该示例的用户操作各种键盘和眼睛凝视接口,用于打字,控制模拟机器人和玩视频游戏。结果表明,我们的共同信息得分可预测各个领域中的基础任务完成指标,而Spearman的平均等级相关性为0.43。除了对现有接口的离线评估外,我们还使用无监督的目标从头开始学习接口:我们随机初始化接口,让用户尝试使用接口执行其所需的任务,测量相互信息得分并更新接口通过强化学习最大化相互信息。我们通过用户研究与12名参与者进行用户研究评估我们的方法,他们使用扰动的鼠标执行2D光标控制任务,并使用手势使用手势的一个用户玩《 Lunar Lander》游戏的实验。结果表明,我们可以在30分钟内从头开始学习一个接头,无需任何用户监督或任务的先验知识。
translated by 谷歌翻译